Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification

Text pre-processing is an important component of a Chinese text classification. At present, however, most of the studies on this topic focus on exploring the influence of preprocessing methods on a few text classification algorithms using English text. In this paper we experimentally compared fifteen commonly used classifiers on two Chinese datasets using three widely used Chinese preprocessing methods that include word segmentation, Chinese specific stop word removal, and Chinese specific symbol removal. We then explored the influence of the preprocessing methods on the final classifications according to various conditions such as classification evaluation, combination style, and classifier selection. Finally, we conducted a battery of various additional experiments, and found that most of the classifiers improved in performance after proper preprocessing was applied. Our general conclusion is that the systematic use of preprocessing methods can have a positive impact on the classification of Chinese short text, using classification evaluation such as macro-F1, combination of preprocessing methods such as word segmentation, Chinese specific stop word and symbol removal, and classifier selection such as machine and deep learning models. We find that the best macro-f1s for categorizing text for the two datasets are 92.13% and 91.99%, which represent improvements of 0.3% and 2%, respectively over the compared baselines.


Introduction
Text classification is the process of determining the text category according to natural language text under the predefined category set, which means assigning predefined category tags to the text [1].It is widely used in numerous fields, such as internet information filtering [2], question and answer topic classification [3], intelligent recommendation systems [4], sentiment analysis [5,6], and public opinion analysis [7].
Natural language processing is one of the most important components of text classification [8].The research context of Chinese information processing uses the computer to process and operate Chinese phonetic, shape, meaning, and other language information, including the input, output, recognition, conversion, compression, storage, retrieval, analysis, understanding [9], as well as the generation of characters, words, phrases, sentences, and chapters [10].Currently, researchers within this field have begun to realize the importance of preprocessing during text classification tasks.In the context of multiple languages, different text preprocessing methods have emerged and have been used to study their impact on the accuracy of text classification [11].Controlling a single text classification algorithm, researchers have studied whether different text preprocessing methods have a positive or negative effect on the classification results.However, this kind of research primarily focuses on English text with a few Chinese text classifiers being considered.Most of them only conduct experiments on a specific classification algorithm, ignoring the sensitivity of different models to different preprocessing methods.Therefore, three influential text preprocessing methods are obtained for Chinese texts through a large number of investigations: word segmentation, Chinese specific stop word removal, and Chinese specific symbol removal.The text classification algorithms that can be applied to the Chinese field are divided into three categories: simple machine learning algorithms, deep learning algorithms and algorithms based on pretraining language models.After combining the three Chinese text preprocessing methods, experiments are carried out on different text classification algorithms, in order to explore the influence of Chinese text preprocessing methods more accurately regarding the performance of text classification on the basis of model sensitivity.
The following content of the paper has been organized as follows: Section 2-The research objective for this paper.Section 3-Introduces the exploration of related work regarding preprocessing methods' influence on the classification model.Section 4-Explains the text preprocessing methods.Section 5-Illustrates the details of the datasets and models.Section 6-Experimental results explained in detail.Section 7-Analysis of the experimental results.Section 8-Conclusions.

Research objective
The purpose of this study is to analyze and measure the effect of preprocessing techniques on the performance of Chinese text classification models.The overall workflow of this paper is shown in Fig 1.
We explore the influence of the proposed widely utilized preprocessing methods for Chinese text that include word segmentation (WS), Chinese specific stop word removal (CSSWR), and Chinese specific symbol removal (CSSR), in a divergent mode.All possible combinations of the preprocessing methods of the fifteen different classifiers are considered, which are machine learning classifiers, deep learning classifiers, and pretrained language model-based classifiers.Therefore, the effort on the part of the referring preprocessing methods in regard to the success of Chinese classification concerning the potential influencers among these methods as well as the methods of habituation to the models are discussed.In order to elucidate the variations of this work from previous ones, the analytic preprocessing methods and experimental conditions are illustrated in Table 1.The experimental conditions are the preprocessing methods, the language, the considered classifiers, and the results.
The contributions of this paper are as follows: • We make an effort to explore the combinations from the aspects of the preprocessing methods like word segmentation, Chinese specific stop word removal and Chinese specific symbol removal, feature extraction methods like TF-IDF, word2vec, pre-trained language model and classifier selection like machine learning model, deep learning model based on general and tax areas.
• The experiments are considered comparatively on two datasets including THUCNews and tax questions.
• Find the best combination of preprocessing and classification models for categorizing text for THUCNews and Tax questions: FastText with WS, CSSWR and CSSR, ChineseBert with CSSWR and CSSR, respectively.And the macro-f1s of them are 92.13% and 91.99% which improved 0.3% and 2%, respectively.

Related work
For the Chinese text classification task, researchers have explored many algorithms designed to resolve semantic understanding problems such as text representation, feature selection, and extraction [23].From these Chinese text classification achievements, we conclude three common types of models machine learning, deep learning, and networks based on pretraining language models.
Common simple machine learning models used for Chinese text classification are SVM [24][25][26], K-nearest Neighbor [27], Random Forest [28], and Naïve Bayes [29].To adapt to Chinese texts, many improved simple machine learning models have emerged to improve the classification effect of Chinese texts.Novel loss functions and the introduction of new features can add to the effects of such models [30,31].Although such classification models are simple and effective, their accuracies are not high, and they are not the first choice if applied to actual industrial applications.
Deep learning models have been populated for years and many researchers have tried to explore techniques to resolve the Chinese classification problem with improved results [32].As a result, the Chinese text feature becomes the research key point in order to improve the performance.Character, word, and sentence level features clearly represent the semantic information of Chinese text.For example, RAFG [33] applies a serialized BLSTM structure to model the sequence characteristics of the Chinese text.The similar structure of BiGRU has also been considered, and it combines Chinese grammar rules in the form of constraints and simulates the linguistic functions of the target sentence by standardizing the output of adjacent positions [34].Additionally, the hybrid attention network applies CNN and RNN to extract the semantic features of Chinese text by capturing class-related attentive representation from the word and character level features simultaneously [35].Furthermore, the fusion of these features can also improve performance [36].The researchers explore multistrategies to implement a better integration.For example, the LSTM, BiLSTM, CNN, and TextCNN features are popular for extracting Chinese semantic [37,38].WCAM [39] intelligently utilizes the attention model to integrate character-level and word-level features used to represent the semantic relationship between the Chinese text.Graph convolution networks also enhances the understanding of the diverse grammatical features of Chinese microblogs for emotion classification [40].Although, the Chinese text classification model based on deep learning is better than being based on simple machine learning, it still has the limitation that the classification effect changes with the complexity of the model.In order to allow the model to fully study and comprehend the text semantic characteristics, to help improve the effect of classification, some preliminary training language models have emerged.Pretraining language models have frequently been applied in large-scale corpus scenarios.For example, researchers use BERT and domain specific corpora for traditional Chinese medicine clinical record classification and have shown great results [41].Extensive experiments on a hybrid model that combines BiGRU and BERT have demonstrated that the Chinese sentiment classification can generate new insights for future development [34].ERNIE is a pretrained language model for Chinese corpus.Question classification has been resolved by considering the ERNIE pretraining model together with feature fusion [42].Specifically, RoBERTa [43] has been implemented and fine-tuned for Chinese text classification, and ChineseBERT [44] incorporates both Chinese character glyph and pinyin information.Although the pretrained language models can achieve good results during most text tasks, the training cost of these methods is extremely high, requiring a large amount of text corpus, compared to the simple and the deep learning models.
Among all the methods used to improve the Chinese classification effect, text preprocessing is an essential step.In addition to optimizing the model structure, proper Chinese pretreatment methods can effectively improve the performance.Many researchers have focused on exploring the effects of text preprocessing on text classification.However, scholars have currently been studying the preprocessing methods influence on English text classification accuracy.For example, many English preprocessing methods have been explored such as removing the rare word [12], using regular expressions for blacklisted words [12], spelling correction [19], HTML tag removal [15,19], tokenization [45], PoS tagging [12,45], stemming [15,45], and removing Unicode strings and noise [15].However, they primarily studied the influence of machine and deep learning models like SMO (a variant of SVM) [19], and Multilayer Perceptron [45].The text preprocessing influence on Arabic emotion analysis task results has been introduced [16].At the same time, the influence of text preprocessing methods on Spanish and English text classification has also attracted significant research attention [14].
The existing literature explored more on English data and other language data.And most of them focus on the number of preprocessing methods and use a single type of feature extraction and classification models to conduct experiments and draw conclusions, ignoring the relationship between different feature extraction methods and classification models and preprocessing methods [13,17,18,20,21].Some common Chinese preprocessing methods include Word segmentation [17,18], Useless symbols and stop word removal [13,17,18,20], Redundancy removal [21], Dose normalization [17,18], Standardization of units [17,18], Incomplete data processing [13], Standardization [13], Deduplication removal [20], and Word frequency statistics [18].In order to solve the problem of exploring the influence relationship of preprocessing methods, feature extraction methods, and classification models in different area, we choose THUCNews [46] and tax question as our research datasets.Therefore, we obtain the optimal combination of preprocessing methods and feature extraction and classification models for the two datasets.And it will provide a reference value for the selection of preprocessing methods for other Chinese text categorization tasks in the future.

Preprocessing methods
The Chinese text preprocessing methods used in this paper are word segmentation (WS), Chinese specific stop word removal (CSSWR), and Chinese specific symbol removal (CSSR).Chinese word segmentation is a necessary step to process Chinese text into discrete word representations that can be understood by the model.Without segmentation, the model struggles to directly comprehend continuous Chinese characters directly.Word segmentation can better express sentence structures and semantic information, enabling the model to better understand Chinese text better and improve classification accuracy.In addition, Chinese stopwords do not contribute to text classification tasks and can increase computational costs during model training and prediction.Removing stopwords can thus reduce the dimensionality of the feature space, contributing to improved model generalization ability and efficiency.Furthermore, Chinese Special characters usually do not carry semantic information, and removing them can purify the text data, enhancing classification accuracy and stability.Finally, data cleaning ensures input data quality, improving the model's robustness and generalization ability.Therefore, we choose these three typical preprocessing methods to explore their influence on Chinese text classification performance.
For Chinese text preprocessing, WS is an indispensable step.It is the process of regrouping successive word sequences according to certain norms.We know that the natural delimiter between words can be represented as a space in English writing.The Chinese word, sentence, and paragraph can simply be represented by obvious delimiter demarcations; however, it cannot be represented in this way due to the form of Chinese words.There are abundant deformations in English words.To cope with these complex transformations, English NLP has some unique processing steps compared to Chinese, which are called Lemmatization and Stemming extraction.Their exists problems in the division of English phrases.Chinese is much more complex and more difficult than English regarding word layer.For example, Chinese word segmentation needs to consider the granularity problem, the larger the granularity, the more accurate the meaning of the expression, leading to less recall.Therefore, Chinese requires different scenarios and requires different granularity.Common Chinese word segmentation tools primarily include Hanlp, jieba, and Stanford tokenization, while Gensim, NLTK, and Keras are often used as English word segmentation.In this paper, Jieba is primarily used as an auxiliary tool for Chinese word segmentation.
CSSWR contains not only the common Chinese stop words like pronouns, prepositions, conjunctions, interjections, and onomatopoeias, but also the meaningless Chinese phrases.According to our dataset, the specific phrases like "please (请问)" and "hello (你好)" are identified as situational language.
CSSR refers to the varied punctuation in the Chinese grammar database.For example, "《》" represents a book or article in Chinese.It has not been defined in English since the title of the book or newspaper has been represented in italic or underlined."、" represents the meaning of a stop sign and plays the role of dividing the parallel elements into sentences.There is no stop sign in English, and commas are often used for parallel elements in segmented sentences.Chinese has a space sign "�", which is used in the middle of words that need to be separated, such as month and date, transliterated first and last name.
In this paper, all combinations of Chinese text preprocessing methods have been considered in Table 2. Word segmentation is represented as WS, Chinese specific stop word removal

Dataset description
The first dataset used in this paper is collected from our project.Our dataset primarily contains user questions about the tax field and the corresponding scenario categories.We abstracted the category labels to numbers for model training.Most of these questions are in the form of interrogative sentences, for example "Hello, could I ask about how to fill in the export tax refund filling form?(你好, 想问一下出口退免税备案表怎么填写?)".The scenario category is determined by the data provider.Furthermore, one of the biggest features of this dataset is it contains a large number of daily expressions, for example "Hello (你好)", "Could I ask?", "Could I ask? (咨询一下)", and "Could I ask? (方便问一下)".However, such daily words are meaningless in Chinese.Therefore, they are defined as stop words in this study.
In addition, another feature of this dataset is the inclusion of some technical terms, for example "tax refund (退免税)" in "Hello, could I ask about how to fill in the export tax refund filling form?(你好, 想问一下出口退免税备案表怎么填写?)", or "Differential taxation (差 额征税)" in "How to calculate the difference tax for taxpayer who provide travel services?(纳 税人提供旅游服务选择差额征税怎么计算?)".Such professional words may cause the phenomenon of wrong sentence breaking in Chinese if there is no relevant background knowledge, thus, affecting the sentence understanding accuracy.The last feature is the special punctuation in this dataset, for example "《》" in "How to fill in the contact person on the" Report on Cross-Region Tax-related Matters"?(《跨区域涉税事项报告表》上联系人如何 填写?)".In Chinese, "《》" is often used to refer to specific names such as books or articles, and in this dataset, it is used to refer to tax-related file names or legal provisions.This type of notation does not exist in other languages such as English, which uses double quotes to refer to specific names.
In our dataset, there are a total of 34474 samples including 56 classes, the details are shown in Tables 3 and 4. A subset of the dataset is used to train the network (27579 samples: 80% training dataset), and the remaining data (6895 samples: 20%) is used to test and validate the network.There are 3446 specific samples in the test dataset and 3449 samples in the validation dataset.In the training, test, and validation datasets, there are two parts: sentence and label.
The second dataset we used is THUCNews [46] which is a public dataset.This dataset is based on the generation of historical data filtering of the Sina news RSS subscription channel from 2005 to 2011 and contains 200,000 news documents, all in UTF-8 plain text format, the details are shown in Table 5.On the basis of the original Sina news classification system, the Natural Language Processing Laboratory of Tsinghua University reintegrated and divided 10 candidate categories: finance, realty, stocks, education, science, society, politics, sports, games, and entertainment.

Feature extraction
In our study, the extraction of each text feature is different for different classifiers.For simple machine learning classifiers, the Term Frequency-Inverse Document Frequency (TF-IDF) method is chosen.TF-IDF [47], characterized as term recurrence reverse record recurrence, is utilized to figure out what expressions of a corpus might be ideal to utilize for quantifying the significance of words in a particular document or piece of text.Since there exists different ways of deciding term recurrence, we refer to the crude recurrence of a term in a document, given as follows: where n i,j is the number of occurrences of the word in the document, ∑ k n k,j is the sum of the occurrences of all words in the document, |D| is the total number of files in the corpus and |j: t i 2 d j | is the number of files containing the particular word.For deep learning classifiers, the technique used to train the word vectors, word2vec, carries out two models that take tokenized text and determine a component vector for each sort in this informational index.For this paper we utilized the consistent skip-gram model, a neural network model that maintains a strategic distance from various secret layers to permit incredibly quick and productive preparation as compared to most grouping calculations.During preparation, each word in the informational collection is utilized as a contribution to a logdirect classifier, which learns word portrayals by attempting to foresee words happening inside a specific reach to one or the other side of the word.The skip-gram model is used by default and has the training complexity architecture of where the maximum distance for words is M, N is word representations, and Y is dimensionality.For pre-training representation model, a fine-tuning step is needed to tokenizer the function, especially when the input text is preprocessed.

Experimental configuration
The operating system used in this experiment was Ubuntu 20.04.2 LTS, the programming language was python3.7, the deep learning framework was Pytorch, the graphics card was one Nvidia GeForce RTX 3070 with 8GB of memory, and the CUDA version was 11.0.The optimizer selected was Adam.Given a dataset as an input, Python's NLTK was used and a new file was created as output for each pre-processing technique.For the preprocessing methods of the WS, CSSWR, and CSSR in the general domain, we primarily utilized the jieba tool.Additionally, we have incorporated manually summarized components, including specialized vocabulary lists, stop word lists, and special character lists tailored to the field of taxation.

Classifiers
Among the many available text classifiers, we investigated fifteen popular machine learning, deep learning, and pre-trained language models within the last five years as follows: CNN-BiLSTM-Self-Attention [37].Integrates two semantic features CNN, and BiLSTM based on self-attention mechanisms.
WCAM [39].Integrates two attention model levels: word-level attention model captures salient words, and character-level attention model selects discriminative Chinese text characters.
Syntax-GCN [40].A syntax-based graph convolution network model enhances the diverse grammatical structure understanding of Chinese microblogs.
RoBERTa [43].Is adopted and fine-tuned for Chinese text classification.The model is able to classify Chinese texts into two categories, containing descriptions of legal behavior and descriptions of illegal behaviors.
ChineseBERT [44].Incorporates both the glyph and pinyin information of Chinese characters into a language pretraining language model.Linear Regression (LR).Is a popular algorithm that belongs to the Generalized Linear Model methods and is also known as Maximum Entropy [48].
Naive Bayes (NB).Is a simple but powerful linear classifier and is often applied with the TF-IDF feature.It is the grade factor feature for Chinese information classification [49].
Support Vector Machines (SVM).SVM is often considered as the novel proposed model with optimization in resolving text classification questions [10].
Random Forest (RF).RF operates by constructing a multitude of decision trees during training time and output classification for the case at hand [32].
DPCNN [50].Is a word level-based network that can extract long-distance text dependencies by deepening the network.
FastText [51].Includes using word and N-gram bags to represent statements, as well as using subword and sharing information between categories through hidden representations.
Transformer [52].Transformer consists of self-attention and feed forward neural networks.A trainable neural network based on a transformer can be built in the form of stacked transformers.
BERT [53].A pretrained language model that has a strong language representation ability and feature extraction.
ERNIE [54].Is a pretrained language model based on BERT for Chinese corpus learning and consists of three-level masks: word, phrase, and entity level masking.

Experiments and results
The performance of the experiments has been evaluated based on the following parameters.
where TP represents the sentences, which are labeled as positive and are also predicted as positive, TN means the sentences that are originally labeled as positive but are predicted as negative, FP represents the sentences, which are labeled as negative but predicted as positive, FN refers to the sentences which are labeled as negative and predicted as negative, and N and n mean the class number in different cases.
During the experiments, we explored all possible combinations of the three preprocessing methods.Approximately fifteen classifiers are investigated in this work.
Table 6 shows the classification performance of the four simple machine learning models in combination with different preprocessing methods on the tax question and THUCNews datasets.The results indicate that the model's classification performance improved after combining it with the corresponding preprocessing methods, and the bolded results represent the best results for the two datasets.The best preprocessing method and model combination for the tax question dataset was linear regression with CSSWR, which attained a macro-f1of 91.00%, 1.97% better than without the preprocessing method, and the best preprocessing method and model combination for the THUCNews dataset was SVM with CSSWR and CSSR, which attained a macro-f1of 89.37%, 1.17% higher than without the preprocessing method.
Similarly, Tables 7 and 8 present the classification performance of the seven deep learning models after combining different preprocessing methods on the tax question and THUCNews datasets.Once again the results indicate that the model's classification performance improved after combining it with the corresponding preprocessing methods, and the bolded results represent the best results for the two datasets.The best preprocessing method and deep learning model combination for the tax question dataset was CNN-BiLSTM Self Attention with CSSWR, which attained a macro-f1of 91.03%, 1.15% better than without the preprocessing method, and the best preprocessing method and deep learning model combination for the THUCNews dataset was FastText with WS, CSSWR, and CSSR, which attained a macro-f1of 92.13%, 0.3% improvement over not using the preprocessing.
Next, Table 9 shows the classification performance of the four pre-trained models in combination with different preprocessing methods on the tax question and THUCNews datasets, where the results indicate that the model's classification performance improved after combining it with the corresponding preprocessing methods, and the bolded results represent the best results for the two datasets.For the tax question dataset, the best preprocessing method and pre-training learning model combination was ChineseBert with CSSWR and CSSR, which attained a macro-f1of 91.99%, 2% increase over the case without the preprocessing method, and the best combination for the THUCNews dataset was also ChineseBert but with CSSWR, which attained a macro-f1of 92.01%, 1.61% higher than without the preprocessing method.From the above results, we conclude that the best preprocessing methods and model combinations for the two datasets were FastText with WS, CSSWR, and CSSR, and Chinese-Bert with CSSWR and CSSR, for the THUCNews and tax question datasets, respectively.For THUCNews the best-performing preprocessing method demonstrates that the correct Chinese word segmentation can help the model to better understand the text, enabling it to extract richer lexical-semantic information, and the division of sentences into words can help the model to differentiate the meanings of different words, and thereby improve its classification accuracy.For instance, the original text sample may be a piece of consecutive Chinese text, such as "Last night's soccer game was very exciting and both teams performed well.(昨晚 的足球比赛非常精彩, 双方球队都表现出色。)".After word segmentation, we can divide it into a series of words, such as "last night (昨晚)", "of (的)", "soccer (足球)", "game (比赛)", "very (非常)", "wonderful (精彩)", ",", "both sides (双方)", "teams (球队)", "both (都)", "performance (表现)", "outstanding (出色)", and "。".Then, after segmentation, each word becomes a feature of the model input, and and removing stop words can reduce the model's attention to some common but not actually semantic words, thus reducing the influence of noise, so that some important keywords occupy a higher weight in the textual representation, which helps the model to better capture the important features of the text.That is, the stop words such as "the (的)", "is (是)", "both (都)", and etc., in the categorization of sports news may not have much distinguishing power.By removing these stop words, we can get a more refined sequence of features such as "last night (昨晚)", "soccer (足球)", "game (比赛)", "very (非常)", "wonderful (精彩)", ",", "both sides (双方)", "teams (球队)", "performance (表现)", "outstanding (出色)", "。".Furthermore, removing special characters can make the sentence cleaner and clearer, which helps the model to understand the syntactic structure and logic of the sentence more accurately and helps it to avoid confusion.For example, by removing the special characters and punctuation, we can get further cleaned up feature sequences, such as "last night (昨晚)", "soccer (足球)", "game (比赛)", "very (非常)", "wonderful (精彩)", "both sides (双方)", "team (球队)", "performance (表现)", "outstanding (出色)".These words focus on sports-related content, removing punctuation that may introduce interference.Therefore, reasonable Chinese word segmentation, stop word removal, and special character removal can each improve the performance of the general domain text categorization tasks.These preprocessing methods help a model to understand the text better, capture key information, and reduce noise interference, thus improving classification results.For the best preprocessing method and model combination for tax question data, stop word removal and special character removal helped to remove noise, improve text clarity, highlight key information, and avoid ambiguity with Chinese-BERT in a jargon-heavy setting.The preprocessing methods help the model to understand the meaning of the tax text, which improves the accuracy and interpretability of the text categorization.For example, the original text of "According to tax regulations, taxpayers are required to complete tax declaration and payment by the end of each month.(根据税务规定, 纳税人需在每月底前完成税款申报和 缴纳。)" becomes "Tax regulations require taxpayers to complete their tax returns for payment at the end of each month.(税务规定, 纳税人需每月底完成税款申报缴纳)" after removing stop words.Removing stop words such as "according to (根据)" and "need to (需)" retains important action and time information, making the sentence more concise and highlighting key information.Likewise, in the original text, "According to the second

Evaluation analysis
In this section, the macro F1 results attained by all 7 combinations of the preprocessing methods are measured to assess the preprocessing influence for Tax Question and THUCNews.
The highest macro F1 among all types of classifiers and the commensurate preprocessing combinations are displayed in the following figures.According to all the models, the difference between the maximum and minimum macro F1 for all combinations of the preprocessing ranges from 0.01% to 3.28%.Moreover, the difference of maco-f1 lies between 0.01% and 1.17% for THUCNews, and between 0.01% and 1.97% for tax question based on the machine learning models in Fig 2 .As for deep learning model, the macro-f1s improve from 0.01% to 3.28% for THUCNews and from 0.04% to 1.03% for tax question in Fig 3 .The same type of the macro-f1s of pre-trained language model are increased by at least 0.01% and up to 2.00% for THUCNews and also at least 0.01% and up to 1.76% for tax question in Fig 4 .The magnitude of differences in macro F1s proves that the congruous preprocessing combinations decide the classifiers that may meliorate the classification effect.

Case study
We conducted a case study for the preprocessing method for short text classification in the field of taxation, and Fig 5 shows the entire workflow.Here we see that first, the original text is processed by a combination of seven preprocessing methods, and then feature extraction is carried out in three ways, namely, TF-IDF, word2vec, and feature characterization and extraction based on a pre-trained language model.After that, feature extraction is carried out in

Conclusion
In this paper, we experimentally compared fifteen commonly used classifiers on two Chinese datasets, THUCNews and tax question datasets, which employed three widely used Chinese preprocessing methods: WS, CSSWR, and CSSR.From the experimental results and discussion, we come to the following conclusions: we conducted a battery of various additional experiments, and found that most of the classifiers improved in performance after proper preprocessing was applied.Our general conclusion is that the systematic use of preprocessing methods can have a positive impact on the classification of Chinese short text, using classification evaluation such as macro-F1, combination of preprocessing methods such as word segmentation, Chinese-specific stop word and symbol removal, and classifier selection such as machine and deep learning models.We find that the best combination of preprocessing and classification models for categorizing text for THUC-News and Tax domain problems are FastText with WS, CSSWR, and CSSR, and ChineseBERT with CSSWR and CSSR, respectively.The macro-f1s of these methods were 92.13% and 91.99% on our tested data, which represent improvements of 0.3% and 2%, respectively over FastText and ChineseBERT themselves.
Our work provides a detailed analysis of the influence of the preprocessing methods on Chinese classifiers and fills the in-depth exploration gap of the Chinese classifier influencing factors and raises attention to the preprocessing methods used during Chinese classification.However, there are still various limitations in this work, such as the lack of preprocessing method comprehensiveness.Subsequent work will build on this foundation and provide a detailed delineation of Chinese preprocessing methods and classifiers, in order to make the research points more convincing.

Table 2 . Combinations of Chinese preprocessing methods.
CSSWR and Chinese specific symbol removal is represented as CSSR.All of the text preprocessing methods have two statuses which are true (T) and false (F).T means application of the method and F represents text processing without the specific method.

Table 9 . Macro-F1 results comparison of four widely used pre-training learning models under seven combinations of preprocessing methods (TQ-tax question data- set; TC-THUCNews).
Article 5 of the Value-added Tax Law, the sales of goods include sales revenue, taxes, surcharges, etc. (根据《增值税法》第五条第二款规定, 货物的销售额包括销售收 入、税金、附加费等)" becomes "According to article 5, paragraph 2, of the VAT Law, the sales of goods include sales revenue, taxes, surcharges, etc. (根据增值税法第五条第二款规 定, 货物的销售额包括销售收入、税金、附加费等)" after removing special characters.The removal of the title and pointed brackets makes the citation of the statute clearer and less intrusive with special characters, which helps the model to understand the provisions more accurately.